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Abstract 

We propose a novel knowledge-rich approach to measuring the 
similarity between a pair of words. The algorithm is tailored to 
Bulgarian and Russian and takes into account the orthographic 
and the phonetic correspondences between the two Slavic lan- 
guages: it combines lemmatization, hand-crafted transformation 
rules, and weighted Levenshtein distance. The experimental re- 
sults show an 11-pt interpolated average precision of 90.58%, 
which represents a sizeable improvement over two classic 
rivaling approaches. 

Keywords 

Orthographic similarity, phonetic similarity, cross-lingual 
transformation. 

1. Introduction 

We propose an algorithm that measures the extent to 
which a Bulgarian and a Russian word are perceived as 
similar by a person who is fluent in both languages. 
Leaving aside the full orthographical identity, we assume 
that words with different orthography can be also 
perceived as similar when they have the same or a similar 
stem and inflections, as in the Bulgarian word 
(KpeKinupaxMe and the Russian aqjqjeKtnupomJiucb (both 
meaning 'we were affected'). 

Bulgarian and Russian are closely related Slavonic 
languages with rich morphology, which motivates us to 
study the typical orthographical correspondences between 
their lexical entries (conditioned phonetically and mor- 
phologically), which we use to formulate and apply trans- 
formation rules for bringing a Russian word close to 
Bulgarian reading and vice versa. Our algorithm for 
measuring the similarity between Bulgarian and Russian 
words first reduces the Russian word to an intermediate 
Bulgarian- sounding form and then compares it orthogra- 
phically to the Bulgarian word. The algorithm starts by 
transliterating the Russian word with the Bulgarian 
alphabet, and then transforms some typical Russian 
morphemes and word parts (e.g., prefixes, suffixes, 
endings, etc.) to their Bulgarian counter-parts. Since both 
Bulgarian and Russian are highly-inflectional languages, 
lemmatization is used to convert the wordforms to their 
lemmata in order to reduce the differences at the morpho- 
logical level. Finally, the orthographic similarity is mea- 
sured using a modified Levenshtein distance with letter- 
specific substitution weights. 



2. Method 

The normalization of the Bulgarian and the Russian 
words into corresponding intermediate forms has phone- 
tic and morphological motivation and is performed as a 
sequence of steps, which will be described below. 

2.1. Transliteration from Cyrillic to Cyrillic 

In a strict linguistic sense, transcription is the process of 
transition from sounds to letters, i.e., from speech to text; 
it is carried out generally in a monolingual context. In a 
bilingual context, the notion of transliteration is used to 
denote the transition of sounds and their letter correspon- 
dences in one language to letters in another language. The 
term transliteration is commonly used for the transition 
of letters when the two languages use different alphabets. 
In this paper, we deal with transliteration since we work 
with written texts. 

The linguistic objective of our investigation is to intro- 
duce more formal criteria to the investigation of possible 
cognates between Russian and Bulgarian. By cognates 
we mean words with equal or close orthography denoting 
the same meaning; words with equal/close orthography 
but different meaning are false cognates/friends. For their 
further investigation in multilingual research, we need to 
define the exact expression of that identity/closeness by 
particular metrics and procedures. 

For a pair of languages from different families, the 
source of cognates is borrowing between them or from a 
third language. Besides borrowing, an essential source of 
cognates in related languages is their common protolan- 
guage. However, in the historical development of both 
languages, three factors lead to different grapheme shape 
for fully identical words: (1) language-specific phonetic 
laws and resulting changes, (2) settings of the spelling 
systems regulating the sound-letter transition, and (3) 
divergence in the grammatical systems and the grammati- 
cal formatives. 

2.1.1 Full coincidence (equality) of letters 

Both Russian and Bulgarian use the Cyrillic alphabet in 
their writing systems, but Russian uses two letters not 
present in Bulgarian: u and 3. Most other letters generally 
show a full coincidence with some exceptions to be listed 
in the following subsections. The list below presents the 
full identity of Cyrillic letters in both languages in the 
cognates: a36yKa - a36yna, 6yKsa - 6yK«a, bojih - eoirn, 



zunc — zunc, dyx - dyx, esda - esda, wcena — wcena, icikoh 
- 3cikoh, ucmuna - ucmuna, uod - uod, Kunapuc - 
Kunapuc, naK - nan, juonema - Monema, hook - hook, 
onopa - onopa, nocm - nocm, peica - pexa, com - com, 
moM - moM, yM - yjw, (paian - (paian, xumuh - xumuh, 
u,apb - uap ,uau - uau, iiiym - myM, Ufum - Ufum, wz - 
10,', Hxma - nxrna 

As the above list shows, the full identity of the 
grapheme shape of cognates is manifested mainly when 
the transformed letter is in initial position. 

2.1.2 Regular letter transitions 

Replacing Russian letters that are missing in the Bulgari- 
an alphabet. The transitions discussed here stem from 
historic differences in the phonetic and the spelling sys- 
tems of the two languages. Bulgarian and Russian differ 
in their contemporary phonetic system mainly at the level 
of pronunciation; in the distinction of soft and hard 
consonants. The Russian-specific letters u and 3 serve to 
denote the variant of a 'hard consonant+u/e' while in 
Bulgarian all consonants preceding u and e are soft. This 
basic difference of the phonetic systems gives us the 
regular correspondence u-u and 3-e in all Russian- 
Bulgarian cognates containing these two letters, e.g., 
pbi6a -pu6a, nosm- noem. 

Removing a Russian letter. Another regular phonetic dif- 
ference between the two languages, which is also related 
to the opposition soft/hard, is the allowed softness of a 
consonant preceding another consonant (najibmo) or in 
final position (uiecmb). Such phonetic combinations are 
not allowed in Bulgarian: see the corresponding nanmo 
and mecm. This regularity allows us to remove all 
Russian b in these positions in the initial stage of the pro- 
cess of cognate comparison. 

Partial regularity of the letter transitions. In non-initial 
positions, other not so regular but repeated letter corres- 
pondences can be observed, e.g., e-n in xned-xjind, e-h in 
cepn-chpn, o-h in con-chn, y-h in MyoK-MhOK, etc. The 
iterativity of such transitions is due to the specific 
development of the spelling systems in the two languages. 
One such example is the disappearance of some Old 
Slavic letters and their regular replacement with different 
letters in Russian and Bulgarian. The above-mentioned 
change y-h is due to the disappearance of an Old Slavic 
letter called 'big yus' and its regular replacement by 
different vowels in all contemporary Slavic languages. 
The transition is only partially regular since not all occur- 
rences of the letter have the same etymological origin. 

2.2. Transformations of n-grams 

The sound-letter transition legitimated by the spelling ru- 
les of the two languages is specific as well; its specificity 
is observed at the level of the grapheme composition of 
the full cognates, i.e., those that are borrowed from third 
languages or that are identical morphologically. 

Transformations originating from spelling. 



A fundamental difference between Russian and Bulga- 
rian spellings is the treatment of double consonants. 
Russian allows them in every part of the word structure, 
while in Bulgarian they are only possible at the morphe- 
me boundary. Thus, all words borrowed from third 
languages keep their double consonants in Russian, but 
lose them in Bulgarian, e.g., npou,ecc - npoifec, acpcpemn 
- atpeum, etc. In this way, a regular transition //-/ can be 
formulated for all double consonants with the following 
stipulation of grammatical origin. 

In words of Slavic origin, consonant doubling occurs 
mainly at the morpheme boundary, but in Russian the 
phenomenon is more frequent since Russian spelling rules 
are more "phonetic". For example, they reflect the change 
voiced-voiceless for all prefixes ending with 3 and 
preceding the initial c of the next morpheme. Bulgarian 
spelling is more 'morphological' and conservative; it 
keeps the 3 in writing, although it is voiceless in 
pronunciation, e.g., paccyoKdenue - pa3ChOKdenue, 
OeccMepmnbiu - 6e3CMhpmeu, etc. This transformation of 
hard-soft consonants in the final prefix position is only 
valid for the couple 3-c. Thus, the Bulgarian-Russian 
transition 3c-cc can be formulated as regular for prefixes 
only and cannot be viewed as a universal for other parts 
of the word, e.g., KaeKascKuii - KaemsKu. 

Next, the following general question in treating dou- 
ble consonant correspondences arises: if we want to stay 
in the domain of uni- and bigram transformations, remo- 
ving the second consonant in Russian can be ambiguous 
noddepoKamb - noddhpoKaM, but dydducm - dyducm, 
ssodumb - eheewdaM, paeeun - paesun. The legal 
consonant doublings in Bulgarian can be only outlined in 
a larger context - a window of up to five letters, contai- 
ning the prefix and the next consonant, as in npedd, nadd, 
nodd, U33, pa33, etc., where the second consonant should 
be preserved. Note that these exceptions from the rule are 
only valid for double d, 3 and e - final letter of prefixes, 
and for h - first letter of the affix h, e.g., nenpeMenno - 
nenpeMenno, but anHomaifun - anomaifux. . 

Transformations of morphological origin. 

In addition to the divergent development of phonetic 
and spelling systems, the two languages develop different 
grammatical systems, both at a systemic and at a morphe- 
mic level - different categories with different graphemic 
expressions. That divergence leads to different grapheme 
shapes for words that are lexically conceived as cognates, 
e.g., Dicenbi - dKenama, and the difference is manifested 
in the ending part of the word, consisting of affixes, and 
ending and related to grammatical forms. 

The transformations are made in two directions and 
for both languages. They can consist of removal of a 
letter sequence or its transformation. 

1 . Removing agglutinative morphemes. 

Each of the two languages has one agglutinative me- 
chanism of word formation (but for different parts of 
speech) - the reflexive morpheme en and ct> in Russian 
verb conjugation and the postpositioned article in Bulga- 
rian in nominal inflections (for nouns and adjectives). The 



corresponding grammatical meanings are expressed in the 
twin language by other means (the article is totally 
missing in Russian and the reflexivity of verbs is expres- 
sed by a lexical element in Bulgarian - the particle ce). 
Thus, removing these morphemes is the first step in the 
process of conversion to an intermediate form, e.g., 
eecenumbCH - eeceiiumb, Keadparm>m - xeadpam. Note 
that the Russian agglutinative morpheme cn/cb at the end 
of the word are non-ambigous: all 212,000 wordforms 
with the ending en in our Russian grammatical dictionary 
are reflexive verb forms. This is not the case with the 
Bulgarian article, where only removing the morpheme -bin 
for masculin is non-ambiguous, while removing ma, nm 
and other article morpheme can trim the stem, e.g., OKena- 
ma, but Ksadpam-a. We intentionally do not derive a 
transformation rule from the last correspondence. 

Removing Bulgarian articles depends on the accepted 
conception about the place of lemmatization in the 
algorithm - should we set the orthographic similarity for 
all four members of the language pair - lemmata and 
wordforms - or should we measure the similarity at the 
lexical level only - the lemmata. In the latter case, no re- 
moval is necessary (see 1.3) 

2. Transforming ending strings. 

There is a big group of adjectives in the two langua- 
ges derived from other parts of speech and formed with 
the suffix h and an adjectival ending, e.g., uiyM - 
uiyMHbiu, uiyM - luyMeu. When the adjective is derived 
from a noun ending with h, we get a doubled h in the 
Russian lemma and in the Bulgarian wordforms, e.g., 
zapHU3on - zapHU30HHbiu and zapnu30H - zapnu30HHu. 
Another regular correspondence is manifested in the word 
derivation with the suffix ck. All these combinations of h 
/ hh / ck and different adjectival endings give the 
correspondences shown in Table 1 . 



Russian 
Ending 


Bulgarian 
Ending 


Examples 


-HHblU 


-nen 


eoeHHbiu —* eoeueu 


-HblU 


-en 


eeHHbiu - eseneH 


-HHUU 


-nen 


paHHuii — > panen 


-HUU 


-en 


eenepnuu — * senepeu 


-CKUU 


-CKU 


epaMcecwu — > 
(spaMcecKU 


-UU 


-u 


cmpejiKosuu - 
cmpemoeu 


-HHOU 


-nen 


cmeHHOu - cmenen 


-HOU 


-en 


podnou -podeH 


-OU 


-u 


denoeou - denosu 



Table 1 : Transforming Russian adjectives to Bulgarian. 



For verbs, there are some regularities in the correspon- 
dences of the endings of the Russian infinitive and the 
Bulgarian verb's main form in first person singular. Table 
2 below shows some examples. 



Russian 
Ending 


Bulgarian 
Ending 


Examples 


-ommb 


-aM 


dexopupoeanib — > 
deKopupaM 


-Ulflb, - 


-H 


Opodumb — > 6podx 


HITlb 


dnenmb — > 6nen 


-amb 


-aM 


damntb — > daesaM 


-ymb 


-a 


zacnymb - zacna 


-emb 


-en 


dejientb — > 6enen 



Table 2 - Transformation of Russian verbs to Bulgarian. 



Concerning the transformation of endings, it is impor- 
tant to note that two linguistic problems are interrelated 
here: (1) the formal revelation of the morpheme 
boundary, and (2) the correct correspondence with the 
Bulgarian ending. The existing ambiguity in resolving 
these two problems requires serious statistical investigati- 
ons before the rules can be formulated. 

With ambiguity not taken into account, the proposed 
transformation rules for Russian word endings could 
sometimes generate the wrong Bulgarian wordform, e.g., 
eucemb could become eucen, while the correct Bulgarian 
form is eucn. In order to limit the negative impact of that, 
we measure the similarity (1) with and (2) without 
applying rules for lemmatization; we then return the 
higher value of the two. 

2.3. Lemmatization 

Bulgarian and Russian are highly-inflectional languages, 
i.e., they use variety of endings to express the different 
forms of the same word. When measuring orthographic 
similarity, endings could cause major problems since they 
can make two otherwise very similar words appear 
somewhat different. For example, the Bulgarian word 
omnpaeenama ('the directed', a feminine adjective with a 
definite article) and the Russian word omnpaejieHHOJuy 
( l the directed', a masculine adjective in dative case) 
exhibit only about 50% letter overlap, but, if we ignore 
the endings, the similarity between them becomes much 
bigger. Thus, if our algorithm could safely ignore word 
endings when comparing words, it might perform better. 

If we could remove the ending, the similarity would 
be measured using the stem, which is the invariable part 
of the word. Unfortunately, both the ending as a letter 
sequence and the location of the morpheme boundary are 
quite ambiguous in both languages. Thus, we need to 
lemmatize the text, i.e., convert the word to its main form, 
the lemma. If every member of the pair of candidate 
cognates from LI and L2 is represented by a wordform 
(WF) and its lemma (L), then we could compare: LI with 
L2, WF1 with WF2, LI with WF2 and WF1 with L2. 
Considering these four options, we can get a better 
estimation for the similarity not only between close 
wordforms like the Bulgarian omnpaaenama and the Rus- 
sian omnpaeneHHOMy, which look different orthographi- 
cally, but have very close lemmata, but also between such 



very different words like the Bulgarian K^neum 
{'bathing', a gerund) and the Russian xoneuxu {'copeck', 
plural feminine noun). 

The lemmatization of the Bulgarian and the Russian 
words can be done using specialized dictionaries. In the 
present work, we will use two large grammatical dictiona- 
ries that contain words, their lemmata, and some 
grammatical information. 

2.4. Transformation Weights 

Let us now come back to the transliteration rules and to 
the next steps in our algorithm. There are orthographical 
correspondences between candidate cognates that are not 
as undisputable as the general rules, but are still observed 
in the development of the languages, at least for ones with 
a proven etymological basis. As was shown above, the re- 
gular correspondences between the languages can be due 
to phonetic and spelling reasons. Besides the uncon- 
ditional letters transitions described above, not so regular 
ones occur in several cases, and their existence can be 
taken into account when constructing the weight scale for 
measuring similarity. 

A general principle when building a weight scale is 
that the correspondences between letters denoting conso- 
nants and vowels (hereinafter 'vowels' and 'consonants' 
only) should be measured separately. The maximal 
ortographic distance between different letters is 1 (as for 
a-i{) and the maximal similarity has weight 0 (as for a-d). 
All weight values between 0 and 1 are assigned to letter 
correspondences that exist in a non-regular way in some 
cognates (the above-mentioned correspondence y-h was 
due to etymological reasons). Another general admission 
is that consonants and vowels with similar sequences of 
distinctive phonetic features (differing only in the place 
of articulation or in the presence/absence of voice, e.g., 6- 
e, 6-ri) have lower weight distance. The same is valid for 
the pair of letters denoting a regular phonetic change, 
e.g., reduction (as in a-h, o-y) or softening of the prece- 
ding consonant (as in y-w, a-n). Regular correspondences 
observed in a limited lexical sector (e.g., borrowed from 
Latin and Greek) such as z-x also have a lower distance. 

Table 3 shows the letter transformation weights, 
which can be used to measure the orthographic similarity 
after the Bulgarian and Russian words have been transli- 
terated to a subset of the Cyrillic alphabet. 

The weights w{a, b) are used to transform the letter a 
into the letter b and vice versa. This weight function w is 
symmetric by definition, i.e., w{a, b) = w{b, a). All other 
weights not given in Table 3 are equal to 1. 

In order to write the Russian words in the modified 
Bulgarian alphabet used in Table 3, we make the follow- 
ing preliminary transformations for all Russian words: 

3 — > e; bi — > u; b — > (empty letter); t> — > (empty letter) 

Table 3 shapes the match between letters and the so- 
unds they denote in Bulgarian and Russian. It further cor- 
relates weights for letter transformation that have been 
phonetically justified. 



a 


w?( n t'\ — 0 7" ~\a?( n ii\ — 0 8" i/t/f n 7" w?( n \ \ — 0 fv 
vv ict 5 1 1 \J . I , vv yti j lA i \j . o , vv yti , vj i \j . i . vv yti •> V ) \J • \J ^ 

w(a, -b)=0.5; w(a, /o)=0.8; w(a, n)=0.5 


6 


w(6, e)=0.8; w(6, «)=0.6 




wi<? m\—0 ft 




w(p x)=Q 5 




Vi)( fS Wi\—{\ ft 

VVlL/j III I \J , \J 


e 


w(e, w)=0.6; w(e, o)=0J; w(e, j)=0.8; w(e, ■&)=(). 5; 
\v( p io \ — 0 R" w(p w ) — 0 S 




wi'Mc -? \ — 0 R' vj(hc ft 

VV \ iST\. j J 1 U.U) VV 1 ly/V IXt f \J ,\J 


^> 


vv 1 J , LI V • *J 


u 


vvl 14 j Ll J \J ,\J , vvylAj vj ) \J . O , VvlLlj y 1 *J.O, vv\ O ) \J .O 

w(u, K))=0.1; w(u, n)=0J 


u 


\v(jj ui\—0 7* w(ij 'a \ — () 7 

'►I iij fiy i v/. / ^ >" ( i*j /i } \j . i 


K 


~\a?( w tyt\—{\ 8" tA?( w y\ — (\ ft 

vvl A., Ill 1 \J . O j VV\I\. j Al \J , \J 


JH 


1/171 M HI— 0 7 
vvyjvij til \j . i 


o 


~\a?( n i A— 0 f' ~\a?{ n t \ — 0 8" ~\a?( n 0 7" 
vv i (_/ , yi vj . \j , vv y u , o ) \J . o , vv y u , tu i \j , i ? 

w(o, ^)=0.8 


n 


w(n, (p)=0.8;w(n, x)=0.9 


c 


w(c, i()=0.6; w(c, w)=0.9 


m 


w(m, <j&)=0.8; w(m, x)=0.9; w(m, i()=0.9 


y 


w(y, h)=0.5; w(y, /o)=0.6; w(y, ,s)=0.8 


4> 


w(<p, !#)=0.8 


X 


w(x, t«)=0.9 


u 


w{u, 4)=0.8 


H 


w(h, w)=0.9 


■b 


w(h, /o)=0.8; w(b, jz)=0.8 


K> 


w(w, >z)=0.8 



Table 3- Letter substitution weights. 



3. The MMEDR Algorithm 

The MMEDR algorithm (modified minimum edit distance 
ratio) measures the orthographic similarity between a pair 
of Bulgarian and Russian words using some general 
phonetic and morphologically conditioned 
correspondences between the letters of the two languages 
in order to estimate the extent to which the two words 
would be perceived as similar by people fluent in both 
languages. It returns a value between 0 and 1, where 
values close to 1 express very high similarity, while 0 is 
returned for completely dissimilar words. The algorithm 
has been tailored for Bulgarian and Russian and thus is 
not directly applicable to other pairs of languages. Howe- 
ver, the general approach can be easily adapted to other 
languages: all that has to be changed are the rules descri- 
bing the phonetic and the morphological correspondences. 

The MMEDR algorithm in steps: 



1 . Lemmatize the Bulgarian word. 

2. Lemmatize the Russian word. 

3. Transform the Russian word's ending. 

4. Transliterate the Russian word. 

5. Remove some double consonants in the Russian 
word. 

6. Calculate the modified Levenshtein distance using 
suitable weights for letter substitutions. 

7. Normalize and calculate the MMEDR value. 

The algorithm first tries to rewrite the Russian word 
following Bulgarian letter constructions. As a result, both 
words are transformed into a special intermediate form 
and then are compared orthographically using Leven- 
shtein distance with suitable weights for individual letter 
substitutions. The above general algorithm is run in eight 
variants with each of steps 1, 2 and 3 being included or 
excluded, and the largest of the eight resulting values is 
returned. A description of each step follows below. 

3.1. Lemmatizing Bulgarian and Russian 
Words 

The Bulgarian word is lemmatized using a grammatical 
dictionary of Bulgarian as described in Section 1.3. If the 
dictionary contains no lemmata for the target word, the 
original word is returned; if it contains more than one 
lemma, we try using each of them in turn and we choose 
the one yielding the highest value in the MMEDR 
algorithm. The Russian word is lemmatized in the same 
way, using a grammatical dictionary of Russian. 

3.2. Transforming the Russian Ending 

At this step, we transform the endings of the Russian 
word according to Tables 1 and 2 and we remove the 
agglutinative suffix ch: 

HHuu — > nen; huu — > en; hhuu — > nen; huu — ► en; uu 
— > u; biii — > u; hhou — ► nen; hou — > en; ou — ► u; ckuu 
— > cku; bCR — * b; osamb — > om; umb — > h; nmb — > n; 
amb — > om\ ymb — > a; emb — > en 

The substitutions rules are applied only if the left hand- 
side letter sequences are at the end of the word. Rules are 
applied in the given order; multiple rule applications are 
allowed. Note that we do not have rules for all possible 
endings in Russian, but only for the typical ones - object 
of transformation for adjectives and verbs. 

Since all words have been already lemmatized in the 
previous step (if applied), verbs are assumed to be in 
infinitive and adjectives in singular masculine form. 
Adjective endings are transformed to their respective 
Bulgarian counter-parts, and reflexive verbs are turned 
into non-reflexive. Nouns are not considered since they 
generally have the same endings in the two languages 
(after having been lemmatized) and thus need no 
additional transformations. 



Of course, there are many exceptions for the above 
rules, but our experiments show that using each of them 
has more positive than negative effect. Initially, we tried 
using few more additional rules, which were subsequently 
removed since they were found to be harmful. 

3.3. Removing Double Consonants 

According to 1.1.3, the following substitution rules are 
applied for the Russian word: 

66 — > 6; otcw; — > ok; kk — > k; jui — > ji; mm — > m; nn — > 
n; pp — > p; cc — * c; mm — ► m; (pip — > <p 

3.4. Calculating the Modified Levenshtein 
Distance with Weights for Letter Substitution 

Given two words, the Levenshtein distance [Levenshtein, 
1965], also known as the minimum edit distance (MED), 
is defined as the minimum total number of single-letter 
substitutions, deletions and/or insertions necessary to 
convert the first word into the second one. We use a 
modification, which we call modified minimum edit 
distance (MMED), where the weights of all insertions and 
deletions are fixed to 1, and the weights for single-letter 
substitution are as given in Table 3. 

3.5. Calculating MMEDR Value 

At this step, we calculate MMEDR value by normalizing 
MMED - we divide it by the length of the longer word 
(the length is calculated after all transformations have 
been made in the previous steps). We use the following 
formula: 

MMED(w, ,w ) 
MMEDR{ w, , w m ) = 1 ] "" } 

3.6. Calculating the Final Result 

The final result is given by the maximum of the obtained 
values for all eight variants of the MMEDR algorithm - 
with/without lemmatization of the Bulgarian word, 
with/without lemmatization of the Russian word, and 
with/without transformation of the Russian word ending. 
Note also, that lemmatization steps might result in 
calculating additional values for MMEDR - one for each 
possible lemma of the Russian/Bulgarian word. 

3.7. Example 

As we will see below, the proposed MMEDR algorithm 
yields significant improvements over classic orthographic 
similarity measures like LCSR {longest common 
subsequence ratio, defined as the longest common letter 
subsequence, normalized by the length of the longer word 
[Melamed, 1999]) and MEDR {minimum edit distance 
ratio, defined as the Levenshtein distance with all weights 
set to 1, normalized by the length of the longer word, also 
known as normalized edit distance /NED/ [Marzal & 



Vidal, 1993]). This is due to the above-described steps 
which turn the Russian word into a Bulgarian-sounding 
one and the application of letter substitution weights that 
reflect the closeness of the corresponding phonemes. 

Let us consider for example the Bulgarian word 
atpeKinupaxMe and the Russian word acjxpeKmupoesajiucb. 
Using the classic Levenshtein distance, we obtain the 
following: MED(acpeKmupaxMe, adxpeKinupoeajiucb) = 7. 
And after normalization: MEDR=l-(7/15) = 8/15 ~ 53%. 
In contrast, with the MMEDR algorithm, we first 
lemmatize the two words, thus obtaining a<peKmupciM and 
a(p(peKinupoeamb respectively. We then replace the 
double Russian consonant -tptp- by -<p- and the Russian 
ending -oeamb by the first singular Bulgarian verb ending 
-cm. We thus obtain the intermediate forms acpeKmupciM 
and CKpexmupaM, which are identical, and MMEDR = 
100%. Note that some pairs of words like a<peKtnupaxMe 
and acpcpeKtnupoeajiucb could be neither orthographically 
nor phonetically close but could be perceived as similar 
due to cross-lingual correspondences that are obvious to 
people speaking both languages. 

Let us take another example - with the Bulgarian 
word u36xzcim and the Russian word om6ezamb (both 
meaning 'to run ouf), which sound similarly. Using 
Levenshtein distance: MED(u36maM,om6ezamb) = 5 and 
thus MEDR = 1 - (5/8) = 3/8 = 37.5%. In contrast, with 
the MMEDR algorithm, we first transform omGezamb to 
its intermediate form om6ezciM and we then calculate 
MMED(u36x20M, omGezciM) = 0.8 + 1 + 0.5 = 2.3 and 
MMEDR = 1 - (2.3/7) = 47/70 ~ 67%, which is a much 
better reflection of the similarity between the two words. 

Thus, we can conclude that, at least in the above two 
examples, the traditional MEDR does not work well for 
the highly inflectional Bulgarian and Russian. MEDR is 
based on the classic Levenshtein distance, which uses the 
same weight for all letter substitution, and thus cannot 
distinguish small phonetic changes like replacing n with e 
(two phonetically very close vowels) from more 
significant differences like replacing si with z (a vowel 
and a consonant that are quite different). 

4. Experiments and Evaluation 

We performed several experiments in order to assess the 
accuracy of the proposed MMEDR algorithm for 
measuring the similarity between Bulgarian and Russian 
words in a literary text. 

4.1. Textual Resources 

We used the Russian novel The Lord of the World 
(BjiacmejiuH Mupa) by Alexander Belyayev [Belayayev, 
1940a] and its Bulgarian translation by Assen Trayanov 
[Belayayev, 1940b] as our test data. We extracted the first 
200 different Bulgarian words and the first 200 different 
Russian words that occur in the novel, and we measured 
the similarity between them. 



# 


Bulga- 
rian 
word 


Rus- 
sian 
word 


MMEDR 


Sim 


Precision 


Recall 


1 


6enaeB 


6eJiaeB 


1 .0000 


Yes 


100.00% 


n /too/ 
0.68% 


2 


Ha 


Ha 


1.0000 


Yes 


100.00% 


1.37% 


3 


raaBa 


raaBa 


1.0000 


Yes 


100.00% 


2.05% 


4 


KaHflH- 

saT 


KaH- 

flHflar 


1.0000 


Yes 


100.00% 


2.74% 


5 


3a 


3a 


1.0000 


Yes 


1 ft/1 f\f\OZ 

100.00% 


3.42% 


6 


Hano- 
jieoH 


Hano- 
jieoHM 


1.0000 


Yes 


100.00% 


4.11% 


7 


He 


He 


1.0000 


Yes 


100.00% 


4.79% 


8 


MH 


Hac 


1.0000 


No 


on cao/ 

8/. 50% 


4.79% 


9 


MH 


MOH 


1.0000 


Yes 


88.89% 


5.48% 


10 


MH 


MH 


1.0000 


Yes 


90.00% 


6.16% 
















93 


HeTBtp- 
THflT 


neT- 

BepTWM 


0.9375 


Yes 


94.57% 


59.59% 


94 


OCTaBflT 


ocTa- 
eTCJi 


0.9286 


Yes 


94.62% 


60.27% 
















39998 


ca 


B 


0.0000 


No 


0.37% 


100% 


39999 


ca 


K 


0.0000 


No 


0.37% 


100% 


40000 


6oflflHC- 

BajiH 


K 


0.0000 


No 


0.37% 


100% 



Table 4 - Results of the MMEDR algorithm. 

4.2. Grammatical Resources 

We used two monolingual dictionaries for lemmatization: 

• A grammatical dictionary of Bulgarian, created at 
the Linguistic Modeling Department, Institute for 
Parallel Processing, Bulgarian Academy of Sciences 
[Paskaleva, 2002]. This electronic dictionary con- 
tained 963,339 wordforms and 73,113 lemmata. 
Each dictionary entry consisted of a wordform, a 
corresponding lemma, and some morphological and 
grammatical information. 

• A grammatical dictionary of Russian, created at 
the Institute of Russian language, Russian Academy 
of Sciences, based on the Grammatical Dictionary of 
A. Zaliznyak [Zaliznyak, 1977]. The dictionary 
consisted of 1,390,613 wordforms and 66,101 lem- 
mata. Each dictionary entry consisted of a word- 
form, a corresponding lemma, and some morpholo- 
gical and grammatical information. 

4.3. Experimental Setup 

We measured the similarity between all 200x200=40,000 
Bulgarian-Russian pairs of words. Among them, 163 
pairs were annotated as very similar by a linguist who 
was fluent in Russian and a native speaker of Bulgarian; 
the remaining 39,837 were considered unrelated. 

We used the MMEDR algorithm to rank the 40,000 
pairs of words in decreasing order according to the 



calculated similarity values. Ideally, the 163 pairs 
designated by the linguist would be ranked at the top. We 
can determine how well the ranking produced by our 
algorithm does using standard measures from information 
retrieval, e.g. 11-point interpolated average precision 
[Manning et al., 2008]. 

We compared the MMEDR algorithm with two classic 
orthographic similarity measures: LCSR and MEDR. 
Unfortunately, we could not directly compare our results 
to those in other work, since there were no previous 
publications measuring orthographic or phonetic similari- 
ty between words in Bulgarian and Russian. 

4.4. Results 

Table 4 shows part of the ranking produced by the 
MMEDR algorithm. The table shows an excerpt of the 
ranked pairs of words along with their similarity 
calculated by the MMEDR algorithm, the corresponding 
human annotation for similarity (the column "Sim"), as 
well as precision and recall calculated for all rows from 
the beginning to the current row. 

Table 5 shows the W-pt interpolated average 
precision for LCSR, MEDR and MMEDR. We can see 
that MMEDR outperforms the other two similarity 
measures by a large margin: 18-22% absolute difference. 



Algorithm 


11-pt interpolated 
average precision 


LCSR 


69.06% 


MEDR 


72.30% 


MMEDR 


90.58% 



Table 5 - Comparison of the similarity measuring algorithms. 

5. Discussion 

As Tables 4 and 5 show, the MMEDR algorithm works 
quite well. Still, there is a lot of room for improvement: 

• Bulgarian and Russian inflectional morphologies are 
quite complex, with many exceptions that are not 
captured by our rules. This is probably a limitation 
of the general approach rather than a deficiency of 
the particular rules used: if we are to capture all 
exceptions, we would need to manually specify 
them all, which would require a lot of extra manual 
work. 

• The transformation rules between Bulgarian and 
Russian are sometimes imprecise as well, e.g., for 
very short words or for words of foreign origin. 

• While linguistically motivated, the letter-for-letter 
substitution weights we used are ad hoc, and could 
be improved. First, while we used symmetric letter 
substitution weight in Table 3, asymmetric weights 
might work better, e.g. the Bulgarian prefixes pa3- 
and U3- are spelled as pac- and uc- in Russian when 



followed by a voiceless consonant. Thus, the 
substiution weight for 3 — > c should probably be 
higher than for c — * 3. We could further extend the 
rules to take into account the local context, e.g., 
changing pa3- to pac- could have a different weight 
than changing -3- to -c- in general. 

• Another potential problem comes from us using only 
one linguist for the annotation, which might have 
yielded biased judgments. To assess the impact of 
the potential subjectivity, we would need judgments 
by at least one additional linguist. 

6. Related Work 

Many algorithms have been proposed in the literature for 
measuring the orthographic and the phonetic similarity 
between pairs of words from different languages. 

The simplest ones considered as orthographically 
close words with identical prefixes [Simard & al., 1992]. 

Much more popular have been orthographic similarity 
measures based on normalized versions of the Levensh- 
tein distance [Levenshtein, 1965], the longest common 
subsequence [Melamed, 1999], and the Dice coefficient 
[Brew and McKelvie, 1996]. 

Somewhat less common have been phonetic similarity 
measures, which compare sounds instead of letter sequen- 
ces. Such an approach has been proposed for the first 
time by [Russel, 1918]. Guy [1994] described an 
algorithm for cognate identification in bilingual word lists 
based on statistics of common sound correspondences. 
Algorithms that learn the typical sound correspondences 
between two languages automatically have also been 
proposed: [Kondrak, 2000], [Kondrak, 2003] and 
[Kondrak & Dorr, 2004]. 

Instead of applying similarity measures for symbolic 
strings on the words directly, some researchers have first 
performed transformations that reflect the typical cross- 
lingual orthographic and phonetic correspondences bet- 
ween the target languages. This is especially important 
for language pairs where some letters in the source 
language are systematically substituted by other letters in 
the target language. The idea can be extended further with 
substitutions of whole syllables, prefixes and suffixes. 
For example, Koehn & Knight [2002] proposed manually 
constructed transformation rules from German to English 
(e.g., the letters k and z are changed to c; and the ending - 
tat is changed to -ty) in order to expand lists of 
automatically extracted cognates. 

Finally, orthographic measures like LCSR and MEDR 
have gradually evolved over the years, enriched by 
machine learning techniques that automatically identify 
templates for cross-lingual orthographic and phonetic 
correspondences. For example, Tiedemann [1999] learned 
spelling transformations from English to Swedish, while 
Mulloni & Pekar [2006] and Mitkov & al. [2007] learned 
transformation templates, which represent substitutions of 
letters sequences in one language with letter sequences in 
another language. 



7. Conclusions and Future Work 

We have described and tested a novel algorithm for mea- 
suring the similarity between pairs of words based on 
transformation rules between Bulgarian and Russian. The 
algorithm has shown very high precision and could be 
used to identify possible candidates for cognates or false 
friends in text corpora. It can also be used in machine 
translation systems working on related languages where it 
could help overcome the incompleteness of translation 
dictionaries used in the system. 

There are many ways in which we could improve the 
proposed algorithm. For example, we could adapt the al- 
gorithms described in [Mitkov et al., 2007] and [Bergsma 
& Kondrak, 2007] to Bulgarian and Russian and try to 
learn cross-lingual transformation rules for morphemes 
and other sub-word sequences automatically. We could 
then try to combine MMEDR with such rules. 
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